Introduction to Web Scraping and Data Management for Social Scientists

Session 1: Scraping Interactive Web Pages

Johannes B. Gruber

2023-07-28

Introduction

The Plan for Today

In this session, we learn how to hunt down wild data. We will:

  • Learn how to find secret APIs
  • Emulate a Browser
  • We focus specifically on step 1 below

Original Image Source: prowebscraper.com

Philipp Pilz via unsplash.com

Request & Collect Raw Data: a closer look

Common Problems

Initially we were planning to scrape researchgate.net, since it contains self-created profiles of many researchers. However, when you try to get the html content:

library(rvest)
read_html("https://www.researchgate.net/profile/Johannes-Gruber-2")
Error in open.connection(x, "rb"): HTTP error 403.

If you don’t know what an HTTP error means, you can go to https://http.cat and have the status explained in a fun way. Below I use a little convenience function:

error_cat <- function(error) {
  link <- paste0("https://http.cat/images/", error, ".jpg")
  knitr::include_graphics(link)
}
error_cat(403)

So what’s going on?

  • If something like this happens, the server essentially did not fullfill our request
  • This is because the website seems to have some special requirements for serving the (correct) content. These could be:
    • specific user agents
    • other specific headers
    • login through a browser cookie
  • To find out how the browser manages to get the correct response, we can use the Network tab in the inspection tool

Strategy 1: Emulate what the Browser is Doing

Open the Inspect Window Again:

But this time, we focus on the Network tab:

Here we get an overview of all the network activity of the browser and the individual requests for data that are performed. Clear the network log first and reload the page to see what is going on. Finding the right call is not always easy, but in most cases, we want:

  • a call with status 200 (OK/successful)
  • a document type
  • something that is at least a few kB in size
  • Initiator is usually “other” (we initiated the call by refreshing)

When you identified the call, you can right click -> copy -> copy as cURL

cURL Calls

What is cURL:

  • cURL is a library that can make HTTP requests.
  • it is widely used for API calls from the terminal.
  • it lists the parameters of a call in a pretty readable manner:
    • the unnamed argument in the beginning is the Uniform Resource Locator (URL) the request goes to
    • -H arguments describe the headers, which are arguments sent with the call
    • -d is the data or body of a request, which is used e.g., for uploading things
    • -o/-O can be used to write the response to a file (otherwise the response is returned to the screen)
    • --compressed means to ask for a compressed response which is unpacked locally (saves bandwith)
curl 'https://www.researchgate.net/profile/Johannes-Gruber-2' \
  -H 'authority: www.researchgate.net' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
  -H 'accept-language: en-GB,en;q=0.9' \
  -H 'cache-control: max-age=0' \
  -H '[Redacted]' \
  -H 'sec-ch-ua: "Chromium";v="115", "Not/A)Brand";v="99"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "Linux"' \
  -H 'sec-fetch-dest: document' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-site: cross-site' \
  -H 'sec-fetch-user: ?1' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36' \
  --compressed

httr2::curl_translate()

  • We have seen httr2::curl_translate() in action yesterday
  • It can also convert more complicated API calls that make look R no diffrent from a regular browser
  • (Remember: you need to escape all " in the call, press ctrl + F to open the Find & Replace tool and put " in the find \" in the replace field and go through all matches except the first and last):
library(httr2)
httr2::curl_translate(
"curl 'https://www.researchgate.net/profile/Johannes-Gruber-2' \
  -H 'authority: www.researchgate.net' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
  -H 'accept-language: en-GB,en;q=0.9' \
  -H 'cache-control: max-age=0' \
  -H 'cookie: [Redacted]' \
  -H 'sec-ch-ua: \"Chromium\";v=\"115\", \"Not/A)Brand\";v=\"99\"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: \"Linux\"' \
  -H 'sec-fetch-dest: document' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-site: cross-site' \
  -H 'sec-fetch-user: ?1' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36' \
  --compressed"
)
request("https://www.researchgate.net/profile/Johannes-Gruber-2") %>% 
  req_headers(
    authority = "www.researchgate.net",
    accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    `accept-language` = "en-GB,en;q=0.9",
    `cache-control` = "max-age=0",
    cookie = "[Redacted]",
    `sec-ch-ua` = "\"Chromium\";v=115\", \"Not/A)Brand\";v=\"99",
    `sec-ch-ua-mobile` = "?0",
    `sec-ch-ua-platform` = "\"Linux\"",
    `sec-fetch-dest` = "document",
    `sec-fetch-mode` = "navigate",
    `sec-fetch-site` = "cross-site",
    `sec-fetch-user` = "?1",
    `upgrade-insecure-requests` = "1",
    `user-agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
  ) %>% 
  req_perform()

‘Emulating’ the Browser Request

request("https://www.researchgate.net/profile/Johannes-Gruber-2") |>
  req_headers(
    authority = "www.researchgate.net",
    accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    `accept-language` = "en-GB,en;q=0.9",
    `cache-control` = "max-age=0",
    cookie = "[Redacted]",
    `sec-ch-ua` = "\"Chromium\";v=115\", \"Not/A)Brand\";v=\"99",
    `sec-ch-ua-mobile` = "?0",
    `sec-ch-ua-platform` = "\"Linux\"",
    `sec-fetch-dest` = "document",
    `sec-fetch-mode` = "navigate",
    `sec-fetch-site` = "cross-site",
    `sec-fetch-user` = "?1",
    `upgrade-insecure-requests` = "1",
    `user-agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
  ) |>
  req_perform()

This used to work quite well when I prepared the slides, but suddenly stopped working over the weekend. So I removed the rest of the slides about it…

Example: ICA (International Communication Association) 2023 Conference

What do we want

  • General goal in the course: we want to build a database of conference attendance and link this to researchers
  • So for each website:
    • Speakers
    • (Co-)authors
    • Paper/talk titles
    • Panel (to see who was in the same ones)

Trying to scrape the programme

  • The page looks straightforward enough!
  • There is a “Conference Schedule” with links to the individual panels
  • The table has a pretty nice class by which we can select it: class="agenda-content"
html <- read_html("https://www.icahdq.org/mpage/ICA23-Program")
Error in open.connection(x, "rb"): HTTP error 403.

Let’s Check our Network Tab

  • I noticed a request that takes quite long and retrieves a relatively large object (500kB)
  • Clicking on it opens another window showing the response
  • Wait, is this a json with the entire conference schedule?

Translating the cURL call

curl_translate("curl 'https://whova.com/xems/apis/event_webpage/agenda/public/get_agendas/?event_id=JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D' \
  -H 'Accept: application/json, text/plain, */*' \
  -H 'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8' \
  -H 'Cache-Control: no-cache' \
  -H 'Connection: keep-alive' \
  -H 'Pragma: no-cache' \
  -H 'Referer: https://whova.com/embedded/event/JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D/' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Site: same-origin' \
  -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36' \
  -H 'sec-ch-ua: \"Chromium\";v=\"115\", \"Not/A)Brand\";v=\"99\"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: \"Linux\"' \
  --compressed")
request("https://whova.com/xems/apis/event_webpage/agenda/public/get_agendas/?event_id=JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D") %>% 
  req_headers(
    Accept = "application/json, text/plain, */*",
    `Accept-Language` = "en-GB,en-US;q=0.9,en;q=0.8",
    `Cache-Control` = "no-cache",
    Connection = "keep-alive",
    Pragma = "no-cache",
    Referer = "https://whova.com/embedded/event/JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D/",
    `Sec-Fetch-Dest` = "empty",
    `Sec-Fetch-Mode` = "cors",
    `Sec-Fetch-Site` = "same-origin",
    `User-Agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
    `sec-ch-ua` = "\"Chromium\";v=115\", \"Not/A)Brand\";v=\"99",
    `sec-ch-ua-mobile` = "?0",
    `sec-ch-ua-platform` = "\"Linux\"",
  ) %>% 
  req_perform()

Requesting the json (?)

ica_data <- request("https://whova.com/xems/apis/event_webpage/agenda/public/get_agendas/?event_id=JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D") |> 
  req_headers(
    Accept = "application/json, text/plain, */*",
    `Accept-Language` = "en-GB,en-US;q=0.9,en;q=0.8",
    `Cache-Control` = "no-cache",
    Connection = "keep-alive",
    Pragma = "no-cache",
    Referer = "https://whova.com/embedded/event/JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D/",
    `Sec-Fetch-Dest` = "empty",
    `Sec-Fetch-Mode` = "cors",
    `Sec-Fetch-Site` = "same-origin",
    `User-Agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
    `sec-ch-ua` = "\"Chromium\";v=115\", \"Not/A)Brand\";v=\"99",
    `sec-ch-ua-mobile` = "?0",
    `sec-ch-ua-platform` = "\"Linux\"",
  ) |> 
  req_perform() |> 
  resp_body_json()
object.size(ica_data) |> 
  format("MB")
[1] "9.4 Mb"

It worked!

Wrangling with Json

  • This json file or the R object it produces is quite intimidating.
  • To get to a certain panel on the fourth day, for example, we have to enter this insane path:
ica_data[["data"]][["agenda"]][[4]][["time_ranges"]][[3]][[2]][[65]][[1]][["sessions"]][[1]]
$id
[1] 3113186

$name
[1] "Race, Ethnicity, and Religion: Media Coverage, Messages, and Reactions"

$event_id
[1] "aic1_202305"

$start_time
[1] "09:00"

$end_time
[1] "10:15"

$calendar_stime
[1] "2023-05-28 09:00:00"

$calendar_etime
[1] "2023-05-28 10:15:00"

$place
[1] "M - Chestnut East"

$desc
[1] "<br /><br /><b>Papers: </b><br />Campaign Outreach or (His)Pandering?: Politician Spanish Usage in Media and Latino Voters<br /><i>Guadalupe Madrigal, U of Missouri - Columbia</i><br /><i>Angela Ocampo, U of Texas at Austin</i><br /><br />When a Pandemic Converted to Islamophobia: Indian News in the Time of Covid-19<br /><i>Arshad Amanullah, National U of Singapore</i><br /><i>Arif Nadaf, Islamic U of Science & Technology, Kashmir, India</i><br /><i>Taberez Neyazi, National U of Singapore</i><br /><br />Blaming Asians for Coronavirus: The Role of Valenced Framing and Discrete Emotions in Hostile Media Effect<br /><i>Juan Liu, Towson U</i><br /><br />Partisanship Supersedes Race: Effects of Discussant Race and Partisanship on Whites’ Willingness to Engage in Race-Specific Conversations<br /><i>Osei Appiah, The Ohio State U</i><br /><i>William Eveland, The Ohio State U</i><br /><i>Christina Henry, The Ohio State U</i><br /><br />Examining Racial Differences in Concerns About Online Polarization<br /><i>Cara Schumann, U of North Carolina at Chapel Hill</i><br /><i>Shannon McGregor, U of North Carolina at Chapel Hill</i> <a href='https://ica2023.cadmore.media/object/451982' style='text-decoration: none; background-color: #789F90; color: #FFFFFF; padding: 5px 10px; border: 1px solid #789F90; border-radius: 15px;'>Open Session</a><br /><br />"

$extra
$extra$docs
list()

$extra$live_stream
$extra$live_stream$url
[1] ""


$extra$recorded_video
$extra$recorded_video$url
[1] ""


$extra$order
[1] 791

$extra$type
[1] "Session"

$extra$rate_enabled
[1] TRUE

$extra$session_feedback_enable
[1] TRUE


$docs
list()

$session_order
[1] 791

$session_feedback_enable
[1] TRUE

$live_stream
$live_stream$url
[1] ""


$recorded_video
$recorded_video$url
[1] ""


$upload_video
NULL

$simulive_upload_video
NULL

$speaker
named list()

$expand
[1] "yes"

$speaker_label
[1] "Session chair"

$type
[1] 1

$sponsors
list()

$programs
list()

$tracks
$tracks[[1]]
$tracks[[1]]$name
[1] "In Person"

$tracks[[1]]$id
[1] 539417

$tracks[[1]]$color
[1] "#5C6BC0"


$tracks[[2]]
$tracks[[2]]$name
[1] "Political Communication"

$tracks[[2]]$id
[1] 540044

$tracks[[2]]$color
[1] "#a15284"



$tags
list()
  • Essentially, someone pressed a relational database into a list format and we now have to scramble to cope with this monstrosity

Parsing the Json

I could not come up with a better method so far. The only way to extract the data is with a nested for loop going through all days and all entries in the object and looking for elements called “sessions”.

library(tidyverse, warn.conflicts = FALSE)
sessions <- list()

for (day in 1:5) {
  
  times <- ica_data[["data"]][["agenda"]][[day]][["time_ranges"]]
  
  for (l_one in seq_along(pluck(times))) {
    for (l_two in seq_along(pluck(times, l_one))) {
      for (l_three in seq_along(pluck(times, l_one, l_two))) {
        for (l_four in seq_along(pluck(times, l_one, l_two, l_three))) {
          
          session <- pluck(times, l_one, l_two, l_three, l_four, "sessions", 1)
          id <- pluck(session, "id")
          if (!is.null(id)) {
            id <- as.character(id)
            sessions[[id]] <- session
          }
          
        }
      }
    }
  }
}

Parsing the Json data

ica_data_df <- tibble(
  panel_id = map_int(sessions, "id"),
  panel_name = map_chr(sessions, "name"),
  time = map_chr(sessions, "calendar_stime"),
  desc = map_chr(sessions, function(s) pluck(s, "desc", .default = NA))
)
ica_data_df
# A tibble: 881 × 4
   panel_id panel_name                                               time  desc 
      <int> <chr>                                                    <chr> <chr>
 1  3113155 PRECONFERENCE: Games and the (Playful) Future of Commun… 2023… "Rec…
 2  3113156 PRECONFERENCE: Generation Z and Global Communication     2023… "Gen…
 3  3113166 PRECONFERENCE: Nothing About Us, Without Us: Authentic … 2023… "Thi…
 4  3113172 PRECONFERENCE: Reimagining the Field of Media, War and … 2023… "As …
 5  3113175 PRECONFERENCE: The Legacies of Elihu Katz                2023… "Eli…
 6  3112705 Human-Machine Preconference Breakout (room 2)            2023…  <NA>
 7  3113080 New Avoidance Preconference Breakout (room 2)            2023…  <NA>
 8  3113150 PRECONFERENCE: 12th Annual Doctoral Consortium of the C… 2023… "The…
 9  3113154 PRECONFERENCE: Ethics of Critically Interrogating and R… 2023… "The…
10  3113158 PRECONFERENCE: Human-Machine Communication: Authenticit… 2023… "The…
# ℹ 871 more rows

Extracting paper title and authors

Finally we want to parse the HTML in the description column.

ica_data_df$desc[100]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      3113023 
"<br /><br /><b>Participants: </b><br /><b><i>(Chairs) </i></b>Wayne Xu, U of Massachusetts Amherst<br /><br /><b>Papers: </b><br />Disentangling the Longitudinal Relationship Between Social Media Use, Political Expression and Political Participation: What Do We Really Know?<br /><i>Jörg Matthes, U of Vienna</i><br /><i>Andreas Nanz, U of Vienna</i><br /><i>Marlis Stubenvoll, U of Vienna</i><br /><i>Ruta Kaskeleviciute, U of Vienna</i><br /><br />Political Discussions on Russian YouTube: How Did They Change Since the Start of the War in Ukraine?<br /><i>Ekaterina Romanova, U of Florida</i><br /><br />Perceptions of and Reactions to Different Types of Incivility in Public Online Discussions: Results of an Online Experiment<br /><i>Marike Bormann, Unviersity of Düsseldorf</i><br /><i>Dominique Heinbach, Heinrich-Heine-U</i><br /><i>Jan Kluck, U of Duisburg-Essen</i><br /><i>Marc Ziegele, Heinrich Heine U</i><br /><br />When Trust in AI Mediates: AI News Use, Public Discussion, and Civic Participation<br /><i>Seungahn Nah, U of Florida</i><br /><i>Chun Shao, Arizona State U</i><br /><i>Ekaterina Romanova, U of Florida</i><br /><i>Gwiwon Nam, U of Florida</i><br /><i>Fanjue Liu, U of Florida</i> <a href='https://ica2023.cadmore.media/object/451094' style='text-decoration: none; background-color: #789F90; color: #FFFFFF; padding: 5px 10px; border: 1px solid #789F90; border-radius: 15px;'>Open Session</a><br /><br />" 

We can inspect one of the descriptions using the same function as in session 3:

check_in_browser <- function(html) {
  tmp <- tempfile(fileext = ".html")
  writeLines(as.character(html), tmp)
  browseURL(tmp)
}
check_in_browser(ica_data_df$desc[100])

Extracting paper title and authors using a function

I wrote another function for this. You can check some of the panels using the browser: check_in_browser(ica_data_df$desc[100]).

pull_papers <- function(desc) {
  # we extract the html code starting with the papers line
  papers <- str_extract(desc, "<b>Papers: </b>.+$") |> 
    str_remove("<b>Papers: </b><br />") |> 
    # we split the html by double line breaks, since it is not properly formatted as paragraphs
    strsplit("<br /><br />", fixed = TRUE) |> 
    pluck(1)
  
  
  # if there is no html code left, just return NAs
  if (all(is.na(papers))) {
    return(list(list(paper_title = NA, authors = NA)))
  } else {
    # otherwise we loop through each paper
    map(papers, function(t) {
      html <- read_html(t)
      
      # first line is the title
      title <- html |> 
        html_text2() |> 
        str_extract("^.+\n")
      
      # at least authors are formatted italice
      authors <- html_elements(html, "i") |> 
        html_text2()
      
      list(paper_title = title, authors = authors)
    })
  }
}

Now we have all the information we wanted:

ica_data_df_tidy <- ica_data_df |> 
  slice(-613) |> 
  mutate(papers = map(desc, pull_papers)) |> 
  unnest(papers) |> 
  unnest_wider(papers) |> 
  unnest(authors) |> 
  select(-desc) |> 
  filter(!is.na(authors))
ica_data_df_tidy
# A tibble: 8,169 × 5
   panel_id panel_name                            time       paper_title authors
      <int> <chr>                                 <chr>      <chr>       <chr>  
 1  3113249 The Powers of Platforms               2023-05-2… "Serve the… Changw…
 2  3113249 The Powers of Platforms               2023-05-2… "Serve the… Ziyi W…
 3  3113249 The Powers of Platforms               2023-05-2… "Serve the… Joel G…
 4  3113249 The Powers of Platforms               2023-05-2… "Empowered… Andrea…
 5  3113249 The Powers of Platforms               2023-05-2… "Empowered… Jacob …
 6  3113249 The Powers of Platforms               2023-05-2… "The Rise … Guy Ho…
 7  3113249 The Powers of Platforms               2023-05-2… "Google Ne… Lucia …
 8  3113249 The Powers of Platforms               2023-05-2… "Google Ne… Mathia…
 9  3113249 The Powers of Platforms               2023-05-2… "Google Ne… Amalia…
10  3112411 Affiliate Journals Top Papers Session 2023-05-2… "One Year … Eloria…
# ℹ 8,159 more rows

Exercises 1

  1. Open the ICA site in your browser and inspect the network traffic. Can you identify the call to the programme json?

  2. I excluded panel 613 since the function fails on that. Investigate what the problem is

Example: 2023 APSA Annual Meeting & Exhibition

What do we want

  • General goal in the course: we want to build a database of conference attendance and link this to researchers
  • So for each website:
    • Speakers
    • (Co-)authors
    • Paper/talk titles
    • Panel (to see who was in the same ones)

Let’s explore the site a little together

  1. Find an overview of sessions
  2. Find details about each session
  3. Find details about each talk

Inspect the retrieved HTML

The object html is not that easy to evaluate since it contains html code not made for human eyes and the output is truncated while printing.

We can adapt the function we used before to convert the rvest object to a character object and display the content of the object in a browser:

check_in_browser <- function(html) {
  tmp <- tempfile(fileext = ".html")
  writeLines(as.character(html), tmp)
  browseURL(tmp)
}
check_in_browser(html)

So what’s going on?

  • If something like this happens, the server essentially did not fulfill our request
  • Instead of giving us an error (like the 403 we saw before) it simply delivers us something and reports: OK
  • This is because the website seems to have some special requirements for serving the correct content. These could be
    • specific user agents
    • other specific headers
    • login through a browser cookie
  • To find out how the browser manages to get the correct response (with all the links), we can use the Network tab in the inspection tool again

Recording the network traffic

The APSA site essentially uses a hidden API

Following the same strategy as before:

  1. Copy the network call that gets our content from the browser
  2. pasting it into httr2::curl_translate() (make sure to escape mischievous “)
httr2::curl_translate("curl 'https://convention2.allacademic.com/one/apsa/apsa23/index.php?cmd=Online+Program+View+Selected+Day+Submissions&selected_day=2023-08-31&program_focus=browse_by_day_submissions&PHPSESSID=fvjf6ltd4o45kgpv2occcrr0al' \
  -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
  -H 'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8' \
  -H 'Cache-Control: no-cache' \
  -H 'Connection: keep-alive' \
  -H 'Cookie: 9s7asg63fpouugut6m5m2vj36r[msg]=e52640799a6bbcebac16c0205ffc2cd9; fvjf6ltd4o45kgpv2occcrr0al[msg]=999aa7691451c5d15ddf91ee0a902f3b; _ga=GA1.2.2046361133.1690277724; _gid=GA1.2.499473362.1690277724; monster[/one/apsa/apsa23/][fvjf6ltd4o45kgpv2occcrr0al][created]=1690532022; _gat=1; _gat_extraTracker=1; _ga_79KQXM4T08=GS1.2.1690530570.6.1.1690532023.0.0.0; _ga_JWPT5JHJ1E=GS1.2.1690530570.6.1.1690532024.0.0.0' \
  -H 'Pragma: no-cache' \
  -H 'Referer: https://convention2.allacademic.com/one/apsa/apsa23/index.php?cmd=Online+Program+Load+Focus&program_focus=browse_by_day_submissions&PHPSESSID=fvjf6ltd4o45kgpv2occcrr0al' \
  -H 'Sec-Fetch-Dest: document' \
  -H 'Sec-Fetch-Mode: navigate' \
  -H 'Sec-Fetch-Site: same-origin' \
  -H 'Sec-Fetch-User: ?1' \
  -H 'Upgrade-Insecure-Requests: 1' \
  -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36' \
  -H 'sec-ch-ua: \"Chromium\";v=\"115\", \"Not/A)Brand\";v=\"99\"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: \"Linux\"' \
  --compressed")
request("https://convention2.allacademic.com/one/apsa/apsa23/index.php?cmd=Online+Program+View+Selected+Day+Submissions&selected_day=2023-08-31&program_focus=browse_by_day_submissions&PHPSESSID=fvjf6ltd4o45kgpv2occcrr0al") %>% 
  req_headers(
    Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    `Accept-Language` = "en-GB,en-US;q=0.9,en;q=0.8",
    `Cache-Control` = "no-cache",
    Connection = "keep-alive",
    Cookie = "9s7asg63fpouugut6m5m2vj36r[msg]=e52640799a6bbcebac16c0205ffc2cd9; fvjf6ltd4o45kgpv2occcrr0al[msg]=999aa7691451c5d15ddf91ee0a902f3b; _ga=GA1.2.2046361133.1690277724; _gid=GA1.2.499473362.1690277724; monster[/one/apsa/apsa23/][fvjf6ltd4o45kgpv2occcrr0al][created]=1690532022; _gat=1; _gat_extraTracker=1; _ga_79KQXM4T08=GS1.2.1690530570.6.1.1690532023.0.0.0; _ga_JWPT5JHJ1E=GS1.2.1690530570.6.1.1690532024.0.0.0",
    Pragma = "no-cache",
    Referer = "https://convention2.allacademic.com/one/apsa/apsa23/index.php?cmd=Online+Program+Load+Focus&program_focus=browse_by_day_submissions&PHPSESSID=fvjf6ltd4o45kgpv2occcrr0al",
    `Sec-Fetch-Dest` = "document",
    `Sec-Fetch-Mode` = "navigate",
    `Sec-Fetch-Site` = "same-origin",
    `Sec-Fetch-User` = "?1",
    `Upgrade-Insecure-Requests` = "1",
    `User-Agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
    `sec-ch-ua` = "\"Chromium\";v=115\", \"Not/A)Brand\";v=\"99",
    `sec-ch-ua-mobile` = "?0",
    `sec-ch-ua-platform` = "\"Linux\"",
  ) %>% 
  req_perform()
  1. Running the resulting httr2 and check if we get the right content
html <- request("https://convention2.allacademic.com/one/apsa/apsa23/index.php?cmd=Online+Program+View+Selected+Day+Submissions&selected_day=2023-08-31&program_focus=browse_by_day_submissions&PHPSESSID=fvjf6ltd4o45kgpv2occcrr0al")  |>  
  req_headers(
    Cookie = "Cookie: 9s7asg63fpouugut6m5m2vj36r[msg]=e52640799a6bbcebac16c0205ffc2cd9; fvjf6ltd4o45kgpv2occcrr0al[msg]=999aa7691451c5d15ddf91ee0a902f3b; _ga=GA1.2.2046361133.1690277724; _gid=GA1.2.499473362.1690277724; monster[/one/apsa/apsa23/][fvjf6ltd4o45kgpv2occcrr0al][created]=1690532022; _gat=1; _gat_extraTracker=1; _ga_79KQXM4T08=GS1.2.1690530570.6.1.1690532023.0.0.0; _ga_JWPT5JHJ1E=GS1.2.1690530570.6.1.1690532024.0.0.0",
    Referer = "https://convention2.allacademic.com/one/apsa/apsa23/index.php?cmd=Online+Program+Load+Focus&program_focus=browse_by_day_submissions&PHPSESSID=fvjf6ltd4o45kgpv2occcrr0al",
  ) |> 
  req_perform() |>
  # we add this part to extract the html from the response
  resp_body_html()
panels <- html |> 
  html_elements("li a") |> 
  html_text2()

panel_links <- html |> 
  html_elements("li a") |> 
  html_attr("href")

panels <- tibble(panels, panel_links) |> 
  filter(str_detect(panel_links, "selected_session_id="))
  1. Adapting the httr2 call to make it usable to request other data

Wrapping the secret APSA API

After some investigation, I noticed that the API returns the right information when it has two things in the call:

  • A valid Session ID (PHPSESSID query parameter)
  • A valid referrer header with the same Session ID
  • A Cookie string which matches the Session ID
request_apsa <- function(url,
                       sess_id = NULL,
                       cookies = "9s7asg63fpouugut6m5m2vj36r[msg]=e52640799a6bbcebac16c0205ffc2cd9; fvjf6ltd4o45kgpv2occcrr0al[msg]=999aa7691451c5d15ddf91ee0a902f3b; _ga=GA1.2.2046361133.1690277724; _gid=GA1.2.499473362.1690277724; monster[/one/apsa/apsa23/][fvjf6ltd4o45kgpv2occcrr0al][created]=1690532022; _gat=1; _gat_extraTracker=1; _ga_79KQXM4T08=GS1.2.1690530570.6.1.1690532023.0.0.0; _ga_JWPT5JHJ1E=GS1.2.1690530570.6.1.1690532024.0.0.0") {
  
  # extract the session id from the URL if not given
  sess_id <- str_extract(url, "&PHPSESSID=[a-z0-9]+(&|$)")
  referer <- paste0(
    "https://convention2.allacademic.com/one/apsa/apsa23/index.php?cmd=Online+Program+Load+Focus&program_focus=browse_by_day_submissions", 
    sess_id
  )
  
  request(url) |> 
    req_headers(
      Referer = referer,
      Cookie = cookies
    ) |> 
    # Let's set a cautious rate in case they check for scraping
    req_throttle(6 / 60) |> 
    req_perform() |> 
    resp_body_html()
}

Testing the function

Let’s test this on a panel:

panel_3_html <- request_apsa(panels$panel_links[3])
check_in_browser(panel_3_html)

Write some code to parse the panel

Luckily, the HTML is quite clean and easy to parse with the tools we’ve learned already:

panel_title <- panel_3_html |> 
  html_element("h3") |> 
  html_text2()

panel_description <- panel_3_html |> 
  html_element("blockquote") |> 
  html_text2()

paper_urls <- panel_3_html |> 
  html_elements("li a") |> 
  html_attr("href")

paper_description <- panel_3_html |> 
  html_elements("li a") |> 
  html_text2()

tibble(paper_description, paper_urls) |> 
  # we collected some trash, but can filter it out easily using the URL
  filter(str_detect(paper_urls, "selected_paper_id=")) |> 
  # We separate paper title and authors from each other
  separate(paper_description, into = c("paper", "authors"), sep = " - ") |> 
  # If there are several authors they are divided by ; (we split them up)
  mutate(author = strsplit(authors, split = "; ")) |>
  # pull the list out into a long format
  unnest(author) |> 
  # And add some infoormation from above
  mutate(panel_title = panel_title,
         paper_description = panel_description)
# A tibble: 5 × 6
  paper                  authors paper_urls author panel_title paper_description
  <chr>                  <chr>   <chr>      <chr>  <chr>       <chr>            
1 "Anger-Driven Misinfo… Cengiz… https://c… Cengi… Immigratio… These papers con…
2 "Anger-Driven Misinfo… Cengiz… https://c… Sofia… Immigratio… These papers con…
3 "Kids in Cages: When … Frank … https://c… Frank… Immigratio… These papers con…
4 "Kids in Cages: When … Frank … https://c… Allis… Immigratio… These papers con…
5 "From “Illegal” to “U… Jacob … https://c… Jacob… Immigratio… These papers con…

Let’s wrap this in a function

We combine the request for a panel’s html and the parsing in one function:

scrape_panel <- function(url) {
  sess_id <- str_extract(url, "(?<=selected_session_id=)\\d+")
  message("Requesting session ", sess_id)
  # request the URL with out request function
  html <- request_apsa(url)
  
  
  # Running the parser
  title <- html |> 
    html_element("h3") |> 
    html_text2()
  
  description <- html |> 
    html_element("blockquote") |> 
    html_text2()
  
  paper_urls <- html |> 
    html_elements("li a") |> 
    html_attr("href")
  
  paper_description <- html |> 
    html_elements("li a") |> 
    html_text2()
  
  tibble(paper_description, paper_urls) |> 
    filter(str_detect(paper_urls, "selected_paper_id=")) |> 
    separate(paper_description, into = c("paper", "authors"), sep = " - ") |> 
    mutate(author = strsplit(authors, split = ";")) |> 
    unnest(author) |> 
    mutate(panel_title = title,
           panel_description = description)
}

scrape_panel(panels$panel_links[4])
# A tibble: 7 × 6
  paper                  authors paper_urls author panel_title panel_description
  <chr>                  <chr>   <chr>      <chr>  <chr>       <chr>            
1 Winning Elections wit… Shusei… https://c… "Shus… Dominant P… "An enduring puz…
2 Winning Elections wit… Shusei… https://c… " Yus… Dominant P… "An enduring puz…
3 Winning Elections wit… Shusei… https://c… " Shi… Dominant P… "An enduring puz…
4 Winning Elections wit… Shusei… https://c… " Dan… Dominant P… "An enduring puz…
5 A Theory of Group-Bas… Amy Lo… https://c… "Amy … Dominant P… "An enduring puz…
6 In-Group Anger or Out… Shikha… https://c… "Shik… Dominant P… "An enduring puz…
7 Reelection Can Increa… Lucia … https://c… "Luci… Dominant P… "An enduring puz…

Adding some caching

  • We identified 455 panels on a single day of APSA.
  • So we will have to make many requests in a loop
  • If the loop breaks, all progress is gone :(
  • To prevent that, we should build some caching into the function
scrape_panel <- function(url,
                         cache_dir = "../data/apsa2023/") {
  
  # the default is an empty file name
  f_name <- ""
  
  # If the cache_dir is not empty, a file name in constructed
  if (!is.null(cache_dir)) {
    # we make sure that the cache folder is created if it does not exist
    dir.create(cache_dir, showWarnings = FALSE)
    # we extract the session ID from the URL
    sess_id <- str_extract(url, "(?<=selected_session_id=)\\d+")
    # and use it to construct a file path for saving
    f_name <- file.path(cache_dir, paste0(sess_id, ".rds"))
  }
  
  # if the cache file already exists, we can skip this session :)
  if (!file.exists(f_name)) {
    message("Requesting session ", sess_id)
    html <- request_apsa(url)
    
    title <- html |> 
      html_element("h3") |> 
      html_text2()
    
    description <- html |> 
      html_element("blockquote") |> 
      html_text2()
    
    paper_urls <- html |> 
      html_elements("li a") |> 
      html_attr("href")
    
    paper_description <- html |> 
      html_elements("li a") |> 
      html_text2()
    
    out <- tibble(paper_description, paper_urls) |> 
      filter(str_detect(paper_urls, "selected_paper_id=")) |> 
      separate(paper_description, into = c("paper", "authors"), sep = " - ") |> 
      mutate(author = strsplit(authors, split = ";")) |> 
      unnest(author) |> 
      mutate(panel_title = title,
             panel_description = description)
    if (!is.null(cache_dir)) {
      saveRDS(out, f_name)
    }
  } else {
    # If the file does not exist, we read the cached panel data
    out <- readRDS(f_name)
  }
  
  out
}

scrape_panel(panels$panel_links[4])
# A tibble: 7 × 6
  paper                  authors paper_urls author panel_title panel_description
  <chr>                  <chr>   <chr>      <chr>  <chr>       <chr>            
1 Winning Elections wit… Shusei… https://c… "Shus… Dominant P… "An enduring puz…
2 Winning Elections wit… Shusei… https://c… " Yus… Dominant P… "An enduring puz…
3 Winning Elections wit… Shusei… https://c… " Shi… Dominant P… "An enduring puz…
4 Winning Elections wit… Shusei… https://c… " Dan… Dominant P… "An enduring puz…
5 A Theory of Group-Bas… Amy Lo… https://c… "Amy … Dominant P… "An enduring puz…
6 In-Group Anger or Out… Shikha… https://c… "Shik… Dominant P… "An enduring puz…
7 Reelection Can Increa… Lucia … https://c… "Luci… Dominant P… "An enduring puz…

Much quicker, since I’ve done this before!

Let’s bring it all together

We loop over the days of APSA to collect all links:

days <- seq(as.Date("2023-08-30"), as.Date("2023-09-03"), 1)
panel_links <- map(days, function(d) {
  html <- request_apsa(
    paste0("https://convention2.allacademic.com/one/apsa/apsa23/index.php?cmd=Online+Program+View+Selected+Day+Submissions&selected_day=",
    d,
    "&program_focus=browse_by_day_submissions&PHPSESSID=fvjf6ltd4o45kgpv2occcrr0al"
    ))
  
  html |> 
    html_elements("li a") |> 
    html_attr("href") |> 
    str_subset("session_id")
}) |> 
  unlist()
length(panel_links)
[1] 1574

And now we iterate over these links to collect all panel data:

apsa_data <- map(panel_links, scrape_panel) |> 
  bind_rows()

We make sure to save the combined data:

saveRDS(apsa_data, "../data/apsa_2023_data.rds")

And let’s check the most prolific authors again:

apsa_data |> 
  count(author, sort = TRUE)
# A tibble: 7,309 × 2
   author                                                                      n
   <chr>                                                                   <int>
 1 " Jonathan Nagler, New York University"                                     8
 2 " Joshua A. Tucker, New York University"                                    8
 3 " Baekkwan Park, University of Missouri"                                    5
 4 " Carl Henrik Knutsen, Department of Political Science, University of …     5
 5 " Fabrizio Gilardi, University of Zurich"                                   5
 6 " Peter Loewen, University of Toronto"                                      5
 7 " Aykut Ozturk, University of Glasgow"                                      4
 8 " Beatrice Magistro, California Institute of Technology"                    4
 9 " Geoffrey Sheagley, University of Georgia"                                 4
10 " Maël Dominique Kubli, University of Zurich"                               4
# ℹ 7,299 more rows

Exercises 2

  1. Use your own cookies and session ID to run the function on the page with the URLs

  2. Check the German news website https://www.zeit.de/. It has an interesting quirk that prevents you from scraping the content of the site. What is it and how could you get around it?

Wrap Up

Save some information about the session for reproducibility.

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: EndeavourOS

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.11.0 
LAPACK: /usr/lib/liblapack.so.3.11.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=nl_NL.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Amsterdam
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.2    
 [5] purrr_1.0.1     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1   
 [9] ggplot2_3.4.2   tidyverse_2.0.0 httr2_0.2.3     rvest_1.0.3    

loaded via a namespace (and not attached):
 [1] gtable_0.3.3      jsonlite_1.8.7    selectr_0.4-2     compiler_4.3.1   
 [5] tidyselect_1.2.0  xml2_1.3.5        scales_1.2.1      yaml_2.3.7       
 [9] fastmap_1.1.1     R6_2.5.1          generics_0.1.3    curl_5.0.1       
[13] knitr_1.43        munsell_0.5.0     pillar_1.9.0      tzdb_0.4.0       
[17] rlang_1.1.1       utf8_1.2.3        stringi_1.7.12    xfun_0.39        
[21] timechange_0.2.0  cli_3.6.1         withr_2.5.0       magrittr_2.0.3   
[25] digest_0.6.33     grid_4.3.1        rstudioapi_0.15.0 rappdirs_0.3.3   
[29] hms_1.1.3         lifecycle_1.0.3   vctrs_0.6.3       evaluate_0.21    
[33] glue_1.6.2        codetools_0.2-19  fansi_1.0.4       colorspace_2.1-0 
[37] rmarkdown_2.23    httr_1.4.6        tools_4.3.1       pkgconfig_2.0.3  
[41] htmltools_0.5.5